Gmail Explorer v2

By nickesc / N. Escobar

Welcome back to Gmail Explorer, where last time, we used the Gmail API to grab a list of every email I've ever received, sift throught them, patching up some missing data and pruning things we didn't want. Then, we spent some time visualizing our preliminary data and analyzing what we saw. We're going to expand on that here, and look more closely at some of the trends we saw, while at the same time exploring some avenues we didn't get to go down in the last part. For a more comprehensive description of the dataset, please refer to the other parts.

I wasn't really interested in using Machine Learning in this project. I really don't care for ML, it gives me a bad feeling in the pit of my stomach, and I found very little value in it. I don't want to engage with that technology if I can avoid it. I also wasn't interested much in making predictions, what fun is predicting if a message is going to be spam? I was much more interested in continuing to look at various trends and email history and over time, and just having fun finding out more information about my email use and inbox through analysis of the data, which is what I spent the bulk of my time on.

First things first, let's get our imports and notebook formatting stuff out of the way:

Next, we load in our data.

There's a few transformations we're going to make to the dataset here:

First, we'll grab categorical columns for our four email addresses,

Next, we'll do a similar thing for the labels; we check if the labels contain the regular strings, and assign True or False based on that.

Important to note is that emails can have any, all or none of these labels

Last, we convert our subject and body contents to their character lengths.

My data, unmodified, is about 6 gigabytes. It's a bit of a pain to manage and load in every time, solely because the bodies and subjects of the messages are included. Without them, it's about 20 megabytes. In the next step we swap the email subject and body contents for a character count and end up with a much more manageable dataset. At the end, we export the smaller .csv so we don't have to spend ten-minutes loading the file in every time.

I considered for a little while doing string analysis and some language processing on the bodies, but it's far outside my skills, and I didn't feel confident with it as I was exploring it.

And once we finally have that file loaded into the notebook, we'll convert our dateTime string to a datetime object, since it can't be stored like that in the .csv. We'll also make sure we're only looking at emails sent to one of those four addresses instead of any other that might be floating in the set (there are a few random ones), and we drop about 500 rows cluttering the data.

Finally, we create a clean, new dataframe with all the message data.

Let's get this out of the way first:

This is a huge pairplot! It graphs everything -- to be perfectly honest, it's kind of awful to look at but I really like it, which is why I want to start with it. You also might notice, we actually have not one but two huge pairplots! There are the continuation of the last pairplots from the last part, but here we can also look at how all of our new columns fit into it. It's a little rough to look at, and the boolean columns really don't tell us too much from here, especially in terms of our color-coded plot, but it's a good starting point to get an idea of the data.

PairPlot of all numeric values hugePairPlot

Click the image to view it in a new tab.

Exploring our data more closely

We're looking at twenty-seven columns in our DataFrame:

This should look familiar again, it's the distribution for our messages over time. We wont spend too long on this, as we've seen it and covered it in depth before. What we should remember though is the swap from JG as the primary address to GD and addition of NE, my school email about a year later.

Important to note is that these are graphed over time instead of internalDate now. It makes it much clearer what we're actually looking at.

I also want to point out what happened sometime in 2018 - I was really curious about this dip in JG, it goes way down briefly and the corrects itself. And it's right around the introduction of two emails; it's a lot more noticeable on the histogram than the KDE graph, and we'll look closer at that period in a second, because there was a significant drop in use. Let's see if we can piece something together from the coincidental knowledge I have of the history of my own life.

I'm also interested in the pattern we see in NE, with consistent dips and rises, which I guessed were from the summer, but I want to be sure of that. We're also going to look at NM on its own, since its hard to get any idea of it when it's mixed in with the others.

A new, but very similar graph, is the ECDF plot below. This tells us the proportion of each inbox received over time. This one was really interesting to me as I looked at the behavior of the three dominant addresses. I describe it as a "swap" of the addresses before, but I hadn't realized how apt a description that was yet. In this graph, we can see an almost exact inversion of JG to GD, while NE parallels GD almost exactly, despite being introduced much later.

We can equate these lines to inbox use over time -- and we can see clearly the behavior we described in the first version of this project, how JG has leveled off while GD and NE have increased exponentially. This makes a lot of sense, given the events in the world at the time -- I made a new email address with the intent of making it my main one, and it's taken about five years but JG's use has finally started to crest, as the swap is nearing completion -- which, again, tracks, as I barely use that email anymore, I never sign up with it and have swapped all other accounts off of it to GD.

NE's behavior I can't really explain. I don't know how it parallels GD so well, given they're completely independent and receive very different emails. My hypothesis is that the lack of school emails in the summer creates the plateau you can see on the plot, which offset it just enough to make it non-linear and in line with the exponential growth of GD.

Again, we can point out the weird behavior in 2018 with the next two graphs:

Here we see what we would expect given the shape of the other ECDF graphs.

But if we look a little closer, we can point out only two events visible on this graph: when I started to use JG for more than just a fake Facebook account. In 2016, you see a sharp increase, which you would expect from the email going from being used to unused. The only other event we can see, however, isn't the introduction of one of the other three emails -- it's the dip from JG in 2018.

The same phenonmenon can be seen on the next graph. Once my inbox starts getting consistent use in 2016, it's a linear relationship for the most part, with emails being weighted slightly more heavily towards the end of the year. Except in 2018. We again see the dip reflected, the only thing like it.

Finally, let's look at that period:

I won't lie, I was confused by this. Despite that encyclopedic knowledge of my life, I really don't understand how it's possible I got 0 emails to JG for an entire month. I can only see two ways this is possible:

  1. I permanently deleted all of those emails and just those emails. I know I didn't drop them while imputing or when I dropped the extra addresses.

    These extra addresses are just from spam emails spoofing me I think.

  2. the it's an error on Gmail's end, and they don't have a record of my emails from that month

I think it's more likely I deleted them, but I'm not sure.

Moving on, lets talk about NE for a little bit.

NE has about sixteen-thousand rows, but from our distribution plot above, we know that there's a pattern to the emails on that account. Let's look at it by itself.

With a closer look we can definitely tell that the pattern dips in the summer. This is, presumabley, because Oxy mostly stops sending me emails over the summer, so whatever is left should be non-Oxy emails. We should be able to filter those out but excluding any row with a from address with oxy.edu, which should hopefully leave us with an even distribution of emails through the year.

We can start to see here how Oxy (the back plot) and non-Oxy (the front plot) email distribution compares. A much more consistent distribution, but we're still getting a little bit of the dipping, so we'll also look at the top couple of addresses to make sure we're not missing any Oxy emails that may have come through under another domain.

And here we can see things like Handshake, Google Classroom and the Post Office still came through, as well as emails from a Oxy faculty with personal email addresses, so let's adjust our parameters and shuffle some messages around.

Tada! We have our expected distribution! Our non-Oxy messages form a nice even distribution over time, exactly what we were expecting to see. Similarly, we're going to grab the addresses for my main address, GD and take a look at the more interesting ones.

Here we can see a number of addresses, pulled out and highlighted over time. My favorite of all of these (and by favorite I mean the one that makes me gag a little) is Drop.com. Drop is an online marketplace that sens hobbyist and enthusiast gear. During the pandemic, they grew pretty big and expanded a lot. At the same time, we see on the distribution plot that during the height of people's panic about the pandemic, early-late 2020, they significantly increased their marketing. During a time when millions of people were struggling to afford to live and many were being laid off and prioritizing necessities, Drop ramps up advertising and marketing and push their products as hard as ever. It leaves a bad taste in the mouth. At the same time, we see sites like eBay reduce the number of marketing emails they send, as you can see in its dip, and stores like Nordstrom completely drop their email marketing campaigns encouraging people to spend money they don't have. The New York Times and Youtube, who have both done an enormous amount of coverage on the pandemic, starts sending out more newsletters as the pandemic crests. An interesting comparison of corporate priorities during a crisis.

Finally, I want to spend a short amount of time looking at the label distribution. We could say a lot about them, but I want to keep it brief, they're pretty and speak for themselves in a lot of ways I think. The most interesting of these though, I think, is the spam plot. The spam plot only shows data from 2022. To me, this either means the spam label wasn't created until then, or more likely I think, the spam gets deleted every so often, which would also explain the shape of the graph. It's also intersting to see where the different accounts differ significantly in the types of emails they get, like how NE receives almost all the Forum, Personal, and Important emails, which makes sense given it's my school email. You can also see, in many of these, the drop off that came from the switch away from JG, and other accounts picking up the slack.

Conclusion

All in all, I'm happy with where we've found ourselves. We've taken a good, hard look at the history of my inbox, visualizing different trends and anomalies to get a clearer idea of what happened, how we got there, and the effects afterwards. It also gave a clearer idea of what is actually in my inbox and who sends me emails. I've examined the ways that some individual companies have acted during the pandemic in regards to emails, trying to understand better how companies use marketing differently at different times to achieve different results.

My inbox provided a lot of information -- and though I think that I'm ready to put this project behind me for now, I do want to come back to it in the future, do the same collections and see what new insights I can come up with years from now.